In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(font_scale=2)
sns.set_style("whitegrid")

FIFA Dataset

We will be looking at the FIFA 2018 Dataset. While this is a video game, the developers strive to make their game as accurate as possible, so this data reflects the skills of the real-life players.

Let's load the data frame using pandas.

In [2]:
df = pd.read_csv("FIFA_2018.csv",encoding = "ISO-8859-1",index_col = 0, low_memory = False)

We can take a brief look at the data by calling df.head(). The first 34 columns are attributes that describe the behavior (e.g. aggression) or the skills (e.g. ball control), of each player. The final columns show the player's position, name, nationality, and the club they play for.

The four positions are forward (FWD), midfielder (MID), defender (DEF), and goalkeeper (GK).

In [3]:
df.head()
Out[3]:
Acceleration Aggression Agility Balance Ball control Composure Crossing Curve Dribbling Finishing ... Sprint speed Stamina Standing tackle Strength Vision Volleys Position Name Nationality Club
0 89 63 89 63 93 95 85 81 91 94 ... 91 92 31 80 85 88 FWD Cristiano Ronaldo Portugal Real Madrid CF
1 92 48 90 95 95 96 77 89 97 95 ... 87 73 28 59 90 85 FWD L. Messi Argentina FC Barcelona
2 94 56 96 82 95 92 75 81 96 89 ... 90 78 24 53 80 83 FWD Neymar Brazil Paris Saint-Germain
3 88 78 86 60 91 83 77 86 86 94 ... 77 89 45 80 84 88 FWD L. Surez Uruguay FC Barcelona
4 58 29 52 35 48 70 15 14 30 13 ... 61 44 10 83 70 11 GK M. Neuer Germany FC Bayern Munich

5 rows × 38 columns

We already know that identifying goal-keepers is quite straight-forward, so let's remove the data corresponding from goal-keepers:

In [4]:
df2 = df[df["Position"] != "GK"].copy()
df2.drop(['GK diving',
 'GK handling',
 'GK kicking',
 'GK positioning',
 'GK reflexes'],1,inplace=True)

We can get all the attribute names and store them as labels by using .columns.values

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score,  precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
In [8]:
validation_size = 0.3
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)
In [10]:
print('%30s  %16s' % ("Classifier","accuracy") )
for name, clf in list(dict_classifiers.items()):
    
    clf.fit(X_train, Y_train)
    y_result = clf.predict(X_test)
    acc = accuracy_score(Y_test, y_result)
    print('%30s  %16f' % (name, acc) )
    cmat  = confusion_matrix(Y_test, y_result,labels=["DEF","MID","FWD"])
    print(cmat)
                    Classifier          accuracy
             Nearest Neighbors          0.770999
[[1376  240   14]
 [ 342 1608  231]
 [   7  262  706]]
                           LDA          0.790430
[[1395  231    4]
 [ 308 1632  241]
 [   7  212  756]]
  Gradient Boosting Classifier          0.797743
[[1430  194    6]
 [ 326 1685  170]
 [   7  265  703]]
                 Random Forest          0.711241
[[1258  338   34]
 [ 398 1486  297]
 [  20  295  660]]
                   Naive Bayes          0.726285
[[1290  333    7]
 [ 404 1343  434]
 [   5  127  843]]
             Linear Regression          0.792938
[[1405  220    5]
 [ 305 1668  208]
 [   8  245  722]]